🎬 Netflix Content Analysis & Interactive Visualization Project

Fundamentals of Data Visualization – Final Project

Tools: Python, Pandas, Altair, Jupyter Notebook
Dataset: Netflix Movies & TV Shows (Kaggle)


📌 Project Overview

This project analyzes the Netflix Titles dataset through data cleaning, feature engineering, and interactive visualizations. The goal is to understand how Netflix's catalog has evolved over time, how genres and ratings are distributed, which countries produce the most content, and how long it takes titles to appear on Netflix after their original release.

The notebook includes:

  • Data cleaning & preprocessing
  • Creation of a 3,000-row analysis subset
  • Genre analysis & heatmaps
  • Country-level comparisons
  • Growth of Netflix content over time
  • Lag analysis (release year → year added)
  • Final summary & insights

All visualizations use Altair and are fully interactive when run locally or viewed through NBViewer.

Task & Goal Identification:
The project focuses on questions such as:

  1. How has Netflix’s catalog grown over time?
  2. Which genres dominate the platform?
  3. How do content ratings differ across genres?
  4. Which countries produce the most Netflix titles?
  5. How long does it take for titles to be added to Netflix after release?

Overall, the project provides a comprehensive and interactive look at how Netflix’s content library is structured and how it has evolved, while demonstrating thoughtful design decisions and data-cleaning practices.

In [1]:
import pandas as pd
import altair as alt

# Global chart sizing
CHART_WIDTH = 750
CHART_HEIGHT = 350

alt.themes.enable('default')
alt.data_transformers.disable_max_rows()  # allow > 5k rows if needed
Out[1]:
DataTransformerRegistry.enable('default')

📥 1. Load the Dataset

The original dataset contains 8,807 rows and was sampled down to 3,000 rows to meet performance recommendations for Altair visualizations.

In this section, we load the full dataset to create the subset.

In [2]:
df_full = pd.read_csv("data/netflix_titles.csv")
df_full.shape, df_full.head()

# Quick data overview
df_full.info()
df_full.isna().sum()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB
Out[2]:
show_id            0
type               0
title              0
director        2634
cast             825
country          831
date_added        10
release_year       0
rating             4
duration           3
listed_in          0
description        0
dtype: int64

🎯 2. Create 3,000-Row Subset

Altair performs best with datasets under ~5,000 rows.
We generate a random sample of 3,000 rows using a fixed random seed for reproducibility.

In [3]:
df_subset = df_full.sample(n=3000,random_state=42)#Setting a specific random seed, like 42, makes the sequence of numbers generated by the computer always the same. "Ensures reproducibility"

# Save subset inside the repo's data folder
df_subset.to_csv("data/netflix_titles_subset_3000.csv", index=False)
df_subset.head()
Out[3]:
show_id type title director cast country date_added release_year rating duration listed_in description
4970 s4971 Movie Game Over, Man! Kyle Newacheck Adam DeVine, Anders Holm, Blake Anderson, Utka... United States March 23, 2018 2018 TV-MA 102 min Action & Adventure, Comedies Three buddies with big dreams go from underach...
3362 s3363 Movie Arsenio Hall: Smart & Classy Brian Volk-Weiss Arsenio Hall United States October 29, 2019 2019 TV-MA 63 min Stand-Up Comedy In his first stand-up special, Arsenio Hall di...
5494 s5495 TV Show Kazoops! NaN Reece Pockney, Scott Langley, Alex Babic, Gemm... Australia May 5, 2017 2017 TV-Y 3 Seasons Kids' TV Music meets imagination in this inventive anim...
1688 s1689 TV Show We Are the Champions NaN NaN United States November 17, 2020 2020 TV-MA 1 Season Docuseries, Reality TV Explore an array of unique competitions, from ...
1349 s1350 TV Show Pablo Escobar, el patrón del mal NaN Andrés Parra, Angie Cepeda, Cecilia Navia, Vic... Colombia February 3, 2021 2012 TV-MA 1 Season Crime TV Shows, International TV Shows, Spanis... From his days as a petty thief to becoming hea...

🧼 3. Data Cleaning

In order to prepare the Netflix dataset for effective analysis and visualization, several cleaning and transformation steps were applied. These steps help simplify the dataset, standardize formats, and create new variables that better support visual exploration.

To prepare the dataset for visualization, we:

  • Clean whitespace
  • Convert date_added to datetime
  • Extract year_added
  • Split duration into numeric and type fields
  • Extract the primary country
  • Extract the main genre (first genre listed)

These steps simplify the dataset and make it easier to visualize trends.

📌 1. Handling Whitespace and Text Formatting

Many text fields (e.g., title, director, cast, country, listed_in) contained trailing or leading whitespace. To ensure clean categorical grouping and avoid mismatched categories, whitespace was removed from all string-based columns.

Reason:

Ensures consistent grouping

Prevents categories like "Drama" and "Drama " from being treated as different

In [4]:
df = df_subset.copy()

# Strip whitespace
str_cols = df.select_dtypes(include="object").columns
for col in str_cols:
    df[col] = df[col].str.strip()

📌 2. Parsing date_added and Extracting year_added

The raw date_added column contains full dates (e.g., "September 9, 2019"). This was converted to a datetime object and a new year_added field was extracted.

In [5]:
# Convert date_added

df["date_added"] = pd.to_datetime(df["date_added"], errors="coerce")
df["year_added"] = df["date_added"].dt.year.astype("Int64")

📌 3. Splitting duration

The duration column mixes numbers and units (e.g., "90 min", "2 Seasons"). This was split into:

duration_int → numerical value

duration_type → "min" or "Season(s)"

In [6]:
# Split duration
duration_split = df["duration"].str.split(" ", n=1, expand=True)
df["duration_int"] = pd.to_numeric(duration_split[0], errors="coerce")
df["duration_type"] = duration_split[1]

📌 4. Extracting Primary Country

Some titles list multiple countries (e.g., "United States, Canada"). A simplified country_primary field was created using the first listed country.

In [7]:
# Extract primary country
def get_first_country(x):
    if pd.isna(x): return None
    return x.split(",")[0].strip()

df["country_primary"] = df["country"].apply(get_first_country)

📌 5. Extracting Main Genre

The listed_in column often contains multiple genres, such as: "International Movies, Dramas, Thrillers"

A new field main_genre was created by selecting the first listed genre.

In [8]:
# Extract main genre
def get_main_genre(x):
    if pd.isna(x): return None
    return x.split(",")[0].strip()

df["main_genre"] = df["listed_in"].apply(get_main_genre)

df.head()
Out[8]:
show_id type title director cast country date_added release_year rating duration listed_in description year_added duration_int duration_type country_primary main_genre
4970 s4971 Movie Game Over, Man! Kyle Newacheck Adam DeVine, Anders Holm, Blake Anderson, Utka... United States 2018-03-23 2018 TV-MA 102 min Action & Adventure, Comedies Three buddies with big dreams go from underach... 2018 102.0 min United States Action & Adventure
3362 s3363 Movie Arsenio Hall: Smart & Classy Brian Volk-Weiss Arsenio Hall United States 2019-10-29 2019 TV-MA 63 min Stand-Up Comedy In his first stand-up special, Arsenio Hall di... 2019 63.0 min United States Stand-Up Comedy
5494 s5495 TV Show Kazoops! NaN Reece Pockney, Scott Langley, Alex Babic, Gemm... Australia 2017-05-05 2017 TV-Y 3 Seasons Kids' TV Music meets imagination in this inventive anim... 2017 3.0 Seasons Australia Kids' TV
1688 s1689 TV Show We Are the Champions NaN NaN United States 2020-11-17 2020 TV-MA 1 Season Docuseries, Reality TV Explore an array of unique competitions, from ... 2020 1.0 Season United States Docuseries
1349 s1350 TV Show Pablo Escobar, el patrón del mal NaN Andrés Parra, Angie Cepeda, Cecilia Navia, Vic... Colombia 2021-02-03 2012 TV-MA 1 Season Crime TV Shows, International TV Shows, Spanis... From his days as a petty thief to becoming hea... 2021 1.0 Season Colombia Crime TV Shows

📊 4. Visualizations Design and Implementation

In this section, we use Altair to build interactive and static visualizations that answer the core questions about Netflix’s catalog: growth over time, genre distribution, rating patterns, country contributions, and the delay between release and being added to Netflix.

🧭 Design Justification — Focus on Clarity & Interactivity

  1. Interactive exploration
    Sliders and hover effects allow users to explore trends (such as content growth and country comparisons) without visual clutter.

  2. Consistent, Netflix-inspired colors
    Using red and dark tones creates a cohesive look that matches Netflix branding and improves readability across all charts.

  3. Simplified noisy fields
    Complex fields (ratings, multiple genres, multi-country listings) were cleaned and grouped to highlight clearer patterns in the dataset.

  4. Normalized views for fair comparison
    The stacked area chart uses proportional values to show how genre share changes over time, giving a more accurate picture than raw counts.

  5. Multiple coordinated charts
    Using bar charts, heatmaps, and interactive comparisons provides complementary perspectives and supports a deeper understanding of Netflix’s catalog.

📈 4.1 Titles by Release Year (Movies vs TV Shows)

Figure 4.1 — Growth of Netflix Titles Over Time (Interactive Slider)

This visualization explores how Netflix’s catalog expanded between 2004 and the present using an interactive slider.

In [9]:
import altair as alt

# Filter to modern years
df_modern = df[df["release_year"] >= 2004]

# Define slider: user can choose the *maximum* year to display
year_slider = alt.binding_range(min=2004, max=int(df_modern["release_year"].max()), step=1)
year_param = alt.param("Year", value=int(df_modern["release_year"].max()), bind=year_slider)

chart_year_slider = (
    alt.Chart(df_modern)
    .mark_line(point=True)
    .encode(
        x=alt.X(
            "release_year:O",
            axis=alt.Axis(title="Release Year", labelAngle=45)
        ),
        y=alt.Y(
            "count():Q",
            title="Number of Titles"
        ),
        color=alt.Color(
            "type:N",
            title="Content Type",
            scale=alt.Scale(
                # Netflix-inspired colors: red for Movies, dark gray for TV Shows
                domain=["Movie", "TV Show"],
                range=["#E50914", "#221F1F"]
            )
        ),
        tooltip=["type", "release_year", "count()"]
    )
    .add_params(year_param)
    .transform_filter("datum.release_year <= Year")
    .properties(
        width= CHART_WIDTH,
        height= CHART_HEIGHT,
        title="Netflix Titles by Release Year (2004–Present) – Interactive"
    )
)

chart_year_slider
Out[9]:

⚠️ Note: Years before 2004 contain very few titles, which compresses the chart.

Interpretation:

This chart shows a clear upward trend in the number of titles released from 2004 onward. Before the mid-2010s, Netflix’s catalog grows slowly. Starting around 2014–2016, the number of new titles increases much more rapidly, reflecting Netflix’s global expansion and the introduction of original content. Movies dominate the platform in earlier years, but TV Shows become increasingly common in later years. The overall pattern suggests strong growth in both content volume and diversity during the last decade.

The slider allows users to explore how the catalog looked at different points in time and observe how quickly content availability grows during the 2010s.


🎭 4.2 Top 10 Netflix Genres

Figure 4.2 — Top 10 Most Common Netflix Genres

This chart shows the ten most common primary genres in the dataset. The main_genre column represents the first genre listed for each title, giving a simplified but consistent way to group categories. This visualization helps identify the genres Netflix relies on the most in its catalog.

In [10]:
# Compute top 10 genres
genre_counts = df["main_genre"].value_counts().nlargest(10).reset_index()
genre_counts.columns = ["genre", "count"]

chart_genre_counts = (
    alt.Chart(genre_counts)
    .mark_bar(color="#E50914")  # Netflix red
    .encode(
        x="count:Q",
        y=alt.Y("genre:N", sort="-x"),
        tooltip=["genre", "count"]
    )
    .properties(
        width=CHART_WIDTH,
        height=CHART_HEIGHT,
        title="Top 10 Netflix Genres"
    )
)

chart_genre_counts
Out[10]:

Interpretation:
The chart shows that Dramas and Comedies appear most frequently on Netflix, followed by Documentaries and International content. This highlights Netflix’s emphasis on broadly appealing and globally relevant genres.

💡 Insight: Dramas and Comedies consistently dominate Netflix’s library.


🎨 4.3 Genre × Rating Heatmap

Figure 4.3 — Genre vs Rating Category Heatmap

This heatmap shows how Netflix genres relate to content ratings. Darker colors represent more titles in a specific genre–rating combination. The chart is faceted by content type (Movies vs TV Shows) to highlight differences between the two formats.

In [11]:
# Build the simplified rating group column

# Top genres
def simplify_rating(r):
    if r in ["TV-MA", "R", "NC-17"]:
        return "Mature"
    elif r in ["TV-14", "PG-13"]:
        return "Teen"
    else:
        return "Family/Kids"

df_genre = df[df["main_genre"].notna() & df["rating"].notna()].copy()
df_genre["rating_group"] = df_genre["rating"].apply(simplify_rating)

top_genres = genre_counts["genre"].tolist()
df_genre_top = df_genre[df_genre["main_genre"].isin(top_genres)]
In [12]:
# Use only Top 10 genres for heatmap
heatmap_simple = (
    alt.Chart(df_genre_top)
    .mark_rect()
    .encode(
        x=alt.X("rating_group:N", title="Rating Category"),
        y=alt.Y("main_genre:N", title="Main Genre", sort=top_genres),
        color=alt.Color("count():Q",
                       title="Number of Titles",
                       scale=alt.Scale(scheme="reds")),
        tooltip=["main_genre", "rating_group", "count()"]
    )
    .properties(
        width=CHART_WIDTH * 0.5,
        height=CHART_HEIGHT,
        title="Genre vs. Rating Category"
    )
)

heatmap_simple
Out[12]:

Interpretation:
This heatmap shows how different primary genres distribute across rating categories. Documentaries, Dramas, and Comedies appear most frequently across all ratings, with a particularly strong concentration in the Teen and Mature categories. Family/Kids ratings are much less common overall, indicating that Netflix's catalog skews toward older audiences.


4.4 Genre Distribution Over Time (stacked area)

Figure 4.4 — Relative Genre Distribution Over Time (2004–Present)

This stacked area chart shows how the share of each top genre changes over time. Stacking is normalized to 100%, so we see relative composition rather than raw counts.

In [13]:
genre_year = (
    df[df["release_year"] >= 2004]
    .groupby(["release_year", "main_genre"])
    .size()
    .reset_index(name="count")
)

genre_year = genre_year[genre_year["main_genre"].isin(top_genres)]

chart_genre_over_time = (
    alt.Chart(genre_year)
    .mark_area()
    .encode(
        x=alt.X("release_year:O", title="Release Year"),
        y=alt.Y("count:Q", stack="normalize", title="Share of Titles"),
        color=alt.Color("main_genre:N", title="Main Genre"),
        tooltip=["release_year", "main_genre", "count"]
    )
    .properties(
        width=CHART_WIDTH,
        height=CHART_HEIGHT,
        title="Relative Genre Share Over Time (Top 10 Genres)"
    )
)

chart_genre_over_time
Out[13]:

Interpretation:
This stacked area chart shows how the proportional share of top genres has shifted over time on Netflix. Dramas and Comedies maintain a consistently large share throughout the entire period, reflecting their broad global demand. Documentaries and International TV Shows grow noticeably after 2015, suggesting an increase in global content acquisition and niche audience expansion.
Because the chart is normalized, we see how genre balance changed—not just total volume—making it clear that Netflix became more genre-diverse as the platform scaled.


🌍 4.5 Top Countries Producing Netflix Titles

Figure 4.5 — Countries Producing the Most Netflix Titles

This chart shows the countries with the highest number of titles. Only the primary country listed for each title (country_primary) is used to avoid double-counting multi-country entries. The chart highlights the geographic concentration of content production on Netflix.

In [14]:
country_counts = (
    df["country_primary"].value_counts().nlargest(15).reset_index()
)
country_counts.columns = ["country", "count"]

chart_country_counts = (
    alt.Chart(country_counts)
    .mark_bar(color="#221F1F")
    .encode(
        x=alt.X("count:Q", title="Number of Titles"),
        y=alt.Y("country:N", sort="-x", title="Country"),
        tooltip=["country", "count"]
    )
    .properties(
        width=CHART_WIDTH,
        height=CHART_HEIGHT,
        title="Top 15 Countries Producing Netflix Titles"
    )
)

chart_country_counts
Out[14]:

Interpretation:
The United States and India dominate Netflix's catalog, contributing far more titles than any other country. The next highest contributors—such as the United Kingdom, Japan, Canada, and South Korea—represent strong regional production hubs. The chart highlights how Netflix’s content library is heavily influenced by Hollywood and Bollywood, with growing representation from Asian and European markets.


4.6 Alternative Country View: Interactive Highlight Bar Chart

Figure 4.6 — Interactive Country Comparison (Hover to Highlight)

This interactive chart provides a clearer comparison between countries by allowing users to hover over each bar and temporarily highlight it. This makes it easier to focus on individual countries and compare their contribution to Netflix’s catalog without visual clutter.

In [15]:
highlight = alt.selection_point(on='mouseover', fields=['country'])

chart_country_highlight = (
    alt.Chart(country_counts)
    .mark_bar()
    .encode(
        x="count:Q",
        y=alt.Y("country:N", sort="-x"),
        color=alt.condition(
            highlight,
            alt.value("#E50914"),   # highlight in Netflix red
            alt.value("#221F1F")    # default dark gray
        ),
        tooltip=["country", "count"]
    )
    .add_params(highlight)
    .properties(
        width=CHART_WIDTH,
        height=CHART_HEIGHT,
        title="Interactive Country Comparison"
    )
)

chart_country_highlight
Out[15]:

Interpretation:
This interactive version allows users to hover over bars to highlight one country at a time, making comparison easier than in the static chart. The United States and India remain the clear leaders, but the interaction helps reveal subtler differences among mid-range contributors like the UK, Japan, Canada, and South Korea. This chart is particularly useful when examining how countries compare individually without visual clutter.


4.7 Release Year vs Year Added histogram

Figure 4.7 — Lag Between Release Year and Netflix Addition

This histogram shows how long it takes for titles to appear on Netflix after their original release.

In [16]:
# Compute the lag in years
lag_df = df[df["year_added"].notna()].copy()
lag_df["lag_years"] = lag_df["year_added"] - lag_df["release_year"]
In [17]:
lag_hist = (
    alt.Chart(lag_df)
    .mark_bar()
    .encode(
        x=alt.X("lag_years:Q",
                bin=alt.Bin(step=1),
                title="Years Between Release and Added to Netflix"),
        y=alt.Y("count():Q", title="Number of Titles"),
        color=alt.Color("type:N",
                        title="Type",
                        scale=alt.Scale(domain=["Movie", "TV Show"],
                                        range=["#E50914", "#221F1F"])),
        tooltip=["lag_years", "count()"]
    )
    .properties(
       width=CHART_WIDTH,
       height=CHART_HEIGHT,
       title="Distribution of Delay Between Release and Being Added to Netflix"
    )
)

lag_hist
Out[17]:

Interpretation:
The distribution of lag years shows that most titles are added to Netflix relatively soon after their original release, with the highest concentration occurring between 0 and 5 years. The frequency drops steadily as lag increases, suggesting that Netflix prioritizes acquiring content that is recent or still culturally relevant. Very large lag values are uncommon, indicating that older titles are added less frequently.


5. Evaluation

Purpose of the Evaluation

The goal of this evaluation was to determine whether the visualizations effectively supported the main analytical questions of the project—specifically, understanding trends in Netflix content growth, genre patterns, country contributions, and the lag between release year and Netflix addition.

Participant Recruitment

Participants were recruited from classmates, friends, and coworkers who regularly use Netflix but do not specialize in data visualization. This group represents typical streaming users and is appropriate for evaluating whether the visualizations communicate insights clearly to a general audience.

Evaluation Procedure

Participants interacted with the visualization report and completed a small set of tasks, such as:

  • identifying the most common genres,
  • determining which countries contribute the most titles,
  • interpreting growth trends over time,
  • explaining the lag histogram,
  • and using the interactive slider or hover features.

During the session, they provided feedback on clarity, usability, and how intuitive the interactive elements were.

Measurement Criteria

The evaluation considered several measures:

  • Insight depth: whether participants could derive meaningful insights from each chart.
  • Accuracy: whether participants answered task questions correctly.
  • Use cases: whether the visualizations supported exploration of the dataset in meaningful ways.
  • Usability: ease of navigation, legibility, and clarity of labels and color choices.
  • Engagement: whether participants explored beyond the required tasks using interaction tools.

Assessment of Feedback

Participants reported that the visualizations were clear, visually appealing, and easy to navigate. The interactive elements (slider and highlighting) were particularly effective in helping them explore trends without feeling overwhelmed. Some participants suggested adding short explanatory notes under each chart, which led to the final added interpretation text. Overall, the feedback confirmed that the visualizations successfully communicated the intended insights and were accessible to non-expert users.

📌 6. Conclusion & Final Summary

This project analyzed a subset of Netflix titles to uncover patterns in content growth, genre distribution, ratings, geographic representation, and release timelines. Through a combination of data cleaning, feature engineering, and interactive visualization, several meaningful insights emerged.

📌 Growth Over Time

Netflix’s catalog has grown steadily, with a sharp increase in both Movies and TV Shows after the mid-2010s. The interactive slider highlights how quickly the platform expanded its offerings year by year.

📌 Genre Composition

Genre analysis revealed that Dramas, Comedies, Documentaries, and International TV Shows are among the most dominant genres. Visualization of the top 10 genres shows a clear skew toward story-driven and international content.

📌 Rating Patterns by Genre

By grouping content ratings into Family/Kids, Teen, and Mature categories, the heatmap becomes easier to interpret. Most mature-rated content appears in Dramas, Crime shows, and Action genres, while Kids’ content remains highly specialized. This grouping decision improved clarity and reduced visual clutter compared to showing every individual rating.

📌 Country Contributions

The country bar charts show that the United States and India contribute the largest volume of Netflix content, with notable contributions from the United Kingdom, Japan, Canada, and South Korea. An interactive highlight chart makes country comparison more intuitive.

📌 Lag Between Release and Netflix Addition

The lag histogram reveals that most titles are added to Netflix within 0–10 years of their original release, with a noticeable concentration near low lag values. This trend suggests that Netflix is increasingly acquiring or producing content closer to its release year, especially for TV shows.

🟩 What Worked Well

  • Interactive elements (slider, highlight chart) helped uncover insights quickly.
  • Grouping complex ratings into three categories improved readability.
  • Cleaning steps such as extracting the main genre and primary country simplified analysis.
  • A consistent color palette and chart sizing created a cohesive visual narrative.

🟧 Future Improvements

Future iterations could:

  • Incorporate text mining on descriptions to extract themes or sentiment.
  • Visualize relationships between cast members or directors using network graphs.
  • Compare Netflix content with other streaming platforms for broader context.
  • Build a dashboard version using Altair, Plotly Dash, or Streamlit.

Key Insights:

  • Netflix expanded rapidly after 2015
  • Dramas, Comedies, and Documentaries dominate
  • Ratings are mostly Teen or Mature
  • The U.S. and India contribute the most content
  • Most titles appear on Netflix within 0–10 years of release

Overall Assessment:

Netflix’s catalog is diverse, global, and increasingly rapid in content acquisition.

🙏 Thank You for Reviewing My Project